Overview
LLMs are commonly used to rewrite or make stylistic changes to text. The goal of this competition is to recover the LLM prompt that was used to transform a given text.

Description
NLP workflows increasingly involve rewriting text, but there's still a lot to learn about how to prompt LLMs effectively. This machine learning competition is designed to be a novel way to dig deeper into this problem.

The challenge: recover the LLM prompt used to rewrite a given text. You’ll be tested against a dataset of 1300+ original texts, each paired with a rewritten version from Gemma, Google’s new family of open models.

Evaluation
Evaluation Metric
For each row in the submission and corresponding ground truth, sentence-t5-base is used to calculate corresponding embedding vectors. The score for each predicted / expected pair is calculated using the Sharpened Cosine Similarity, using an exponent of 3. The SCS is used to attenuate the generous score given by embedding vectors for incorrect answers. Do not leave any rewrite_prompt blank as null answers will throw an error.

Submission File
The submission file should contain a header and have the following format:

id,rewrite_prompt
000aaa,"Rewrite this essay but do it using the writing style of Dr. Seuss"
111bbb,"Rewrite this essay but do it using the writing style of William Shakespeare"
222ccc,"Rewrite this essay but do it using the writing style of Tupac Shakur"
...



All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.
Prizes
$200,000 USD total prize pool

1st place: $100,000
2nd place: $40,000
3rd place: $20,000
4th place: $14,000
5th place: $11,000
6th place: $10,000
7th place: $5,000
In addition to the monetary prizes, we will also have special Kaggle merchandise available for 25 Kagglers that do all of: (1) upload a model variation to Kaggle Models; (2) attach that model to a public notebook that makes a successful submission to this competition; and then (3) nominate yourself for recognition by submitting this Google Form.


Code Requirements


This is a Code Competition
Submissions to this competition must be made through Notebooks. In order for the "Submit" button to be active after a commit, the following conditions must be met:

CPU Notebook <= 9 hours run-time
GPU Notebook <= 9 hours run-time
Internet access disabled
Freely & publicly available external data is allowed, including pre-trained models
Submission file must be named submission.csv
Submission runtimes have been slightly obfuscated. If you repeat the exact same submission you will see up to 15 minutes of variance in the time before you receive your score.
Please see the Code Competition FAQ for more information on how to submit. And review the code debugging doc if you are encountering submission errors.

Citation
Will Lifferth, Paul Mooney, Sohier Dane, and Ashley Chow. LLM Prompt Recovery. https://kaggle.com/competitions/llm-prompt-recovery, 2024. Kaggle.

Dataset Description
The competition dataset comprises text passages that have been rewritten by the Gemma 7b-it LLM with undisclosed prompts. The goal of the competition is to determine what prompts were used.

Please note that this is a Code Competition. When your submission is scored, this example test data will be replaced with the full test set. Expect roughly 1,400 original texts in the test set.

Files
[train/test].csv

id - A unique identifier for the row.
original_text - The prompt the essay was written in response to.
rewrite_prompt - The target column. The prompt provided to Gemma.
rewritten_text - The output from Gemma.
sample_submission.csv A submission file in the correct format.

id
rewrite_prompt
Notes
Only one example is provided in both train.csv and test.csv
You should generate additional data to train your model against (example)